Region of Murcia
Overview of the 17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management
IC3K 2025 (17th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management) received 163 paper submissions from 40 countries. To evaluate each submission, a double-blind paper review was performed by the Program Committee. After a stringent selection process, 31 papers were published and presented as full papers, i.e. completed work (12 pages/25' oral presentation), 81 papers were accepted as short papers (54 as oral presentation). The organizing committee included the IC3K Conference Chairs: Ricardo da Silva Torres, Artificial Intelligence Group, Wageningen University & Research, Netherlands and Jorge Bernardino, Polytechnic University of Coimbra, Portugal, and the IC3K 2025 Program Chairs: Le Gruenwald, University of Oklahoma, School of Computer Science, United States, Frans Coenen, University of Liverpool, United Kingdom, Jesualdo Tomás Fernández-Breis, University of Murcia, Spain, Lars Nolle, Jade University of Applied Sciences, Germany, Elio Masciari, University of Napoli Federico II, Italy and David Aveiro, University of Madeira, NOVA-LINCS and ARDITI, Portugal. At the closing session, the conference acknowledged a few papers that were considered excellent in their class, presenting a "Best Paper Award", "Best Student Paper Award", and "Best Poster Award" for each of the co-located conferences.
Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case
Riquelme, Alvaro, Costa, Pedro, Martinez, Catalina
For years, semantic interoperability standards have sought to streamline the exchange of clinical data, yet their deployment remains time-consuming, resource-intensive, and technically challenging. To address this, we introduce a semi-automated approach that leverages large language models specifically GPT-4o and Llama 3.2 405b to convert structured clinical datasets into HL7 FHIR format while assessing accuracy, reliability, and security. Applying our method to the MIMIC-IV database, we combined embedding techniques, clustering algorithms, and semantic retrieval to craft prompts that guide the models in mapping each tabular field to its corresponding FHIR resource. In an initial benchmark, resource identification achieved a perfect F1-score, with GPT-4o outperforming Llama 3.2 thanks to the inclusion of FHIR resource schemas within the prompt. Under real-world conditions, accuracy dipped slightly to 94 %, but refinements to the prompting strategy restored robust mappings. Error analysis revealed occasional hallucinations of non-existent attributes and mismatches in granularity, which more detailed prompts can mitigate. Overall, our study demonstrates the feasibility of context-aware, LLM-driven transformation of clinical data into HL7 FHIR, laying the groundwork for semi-automated interoperability workflows. Future work will focus on fine-tuning models with specialized medical corpora, extending support to additional standards such as HL7 CDA and OMOP, and developing an interactive interface to enable expert validation and iterative refinement.
FILM: Framework for Imbalanced Learning Machines based on a new unbiased performance measure and a new ensemble-based technique
Guillén-Teruel, Antonio, Caracena, Marcos, Pardo, Jose A., de-la-Gándara, Fernando, Palma, José, Botía, Juan A.
This research addresses the challenges of handling unbalanced datasets for binary classification tasks. In such scenarios, standard evaluation metrics are often biased by the disproportionate representation of the minority class. Conducting experiments across seven datasets, we uncovered inconsistencies in evaluation metrics when determining the model that outperforms others for each binary classification problem. This justifies the need for a metric that provides a more consistent and unbiased evaluation across unbalanced datasets, thereby supporting robust model selection. To mitigate this problem, we propose a novel metric, the Unbiased Integration Coefficients (UIC), which exhibits significantly reduced bias ($p < 10^{-4}$) towards the minority class compared to conventional metrics. The UIC is constructed by aggregating existing metrics while penalising those more prone to imbalance. In addition, we introduce the Identical Partitions for Imbalance Problems (IPIP) algorithm for imbalanced ML problems, an ensemble-based approach. Our experimental results show that IPIP outperforms other baseline imbalance-aware approaches using Random Forest and Logistic Regression models in three out of seven datasets as assessed by the UIC metric, demonstrating its effectiveness in addressing imbalanced data challenges in binary classification tasks. This new framework for dealing with imbalanced datasets is materialized in the FILM (Framework for Imbalanced Learning Machines) R Package, accessible at https://github.com/antoniogt/FILM.
GONet: A Generalizable Deep Learning Model for Glaucoma Detection
Abramovich, Or, Pizem, Hadas, Fhima, Jonathan, Berkowitz, Eran, Gofrit, Ben, Meisel, Meishar, Baskin, Meital, Van Eijgen, Jan, Stalmans, Ingeborg, Blumenthal, Eytan Z., Behar, Joachim A.
Glaucomatous optic neuropathy (GON) is a prevalent ocular disease that can lead to irreversible vision loss if not detected early and treated. The traditional diagnostic approach for GON involves a set of ophthalmic examinations, which are time-consuming and require a visit to an ophthalmologist. Recent deep learning models for automating GON detection from digital fundus images (DFI) have shown promise but often suffer from limited generalizability across different ethnicities, disease groups and examination settings. To address these limitations, we introduce GONet, a robust deep learning model developed using seven independent datasets, including over 119,000 DFIs with gold-standard annotations and from patients of diverse geographic backgrounds. GONet consists of a DINOv2 pre-trained self-supervised vision transformers fine-tuned using a multisource domain strategy. GONet demonstrated high out-of-distribution generalizability, with an AUC of 0.85-0.99 in target domains. GONet performance was similar or superior to state-of-the-art works and was significantly superior to the cup-to-disc ratio, by up to 21.6%. GONet is available at [URL provided on publication]. We also contribute a new dataset consisting of 768 DFI with GON labels as open access.
S-VOTE: Similarity-based Voting for Client Selection in Decentralized Federated Learning
Sánchez, Pedro Miguel Sánchez, Beltrán, Enrique Tomás Martínez, Feng, Chao, Bovet, Gérôme, Pérez, Gregorio Martínez, Celdrán, Alberto Huertas
Decentralized Federated Learning (DFL) enables collaborative, privacy-preserving model training without relying on a central server. This decentralized approach reduces bottlenecks and eliminates single points of failure, enhancing scalability and resilience. However, DFL also introduces challenges such as suboptimal models with non-IID data distributions, increased communication overhead, and resource usage. Thus, this work proposes S-VOTE, a voting-based client selection mechanism that optimizes resource usage and enhances model performance in federations with non-IID data conditions. S-VOTE considers an adaptive strategy for spontaneous local training that addresses participation imbalance, allowing underutilized clients to contribute without significantly increasing resource costs. Extensive experiments on benchmark datasets demonstrate the S-VOTE effectiveness. More in detail, it achieves lower communication costs by up to 21%, 4-6% faster convergence, and improves local performance by 9-17% compared to baseline methods in some configurations, all while achieving a 14-24% energy consumption reduction. These results highlight the potential of S-VOTE to address DFL challenges in heterogeneous environments.
Permutation-based multi-objective evolutionary feature selection for high-dimensional data
Espinosa, Raquel, Sánchez, Gracia, Palma, José, Jiménez, Fernando
Feature selection is a critical step in the analysis of high-dimensional data, where the number of features often vastly exceeds the number of samples. Effective feature selection not only improves model performance and interpretability but also reduces computational costs and mitigates the risk of overfitting. In this context, we propose a novel feature selection method for high-dimensional data, based on the well-known permutation feature importance approach, but extending it to evaluate subsets of attributes rather than individual features. This extension more effectively captures how interactions among features influence model performance. The proposed method employs a multi-objective evolutionary algorithm to search for candidate feature subsets, with the objectives of maximizing the degradation in model performance when the selected features are shuffled, and minimizing the cardinality of the feature subset. The effectiveness of our method has been validated on a set of 24 publicly available high-dimensional datasets for classification and regression tasks, and compared against 9 well-established feature selection methods designed for high-dimensional problems, including the conventional permutation feature importance method. The results demonstrate the ability of our approach in balancing accuracy and computational efficiency, providing a powerful tool for feature selection in complex, high-dimensional datasets.
Generative Artificial Intelligence-Supported Pentesting: A Comparison between Claude Opus, GPT-4, and Copilot
Martínez, Antonio López, Cano, Alejandro, Ruiz-Martínez, Antonio
The advent of Generative Artificial Intelligence (GenAI) has brought a significant change to our society. GenAI can be applied across numerous fields, with particular relevance in cybersecurity. Among the various areas of application, its use in penetration testing (pentesting) or ethical hacking processes is of special interest. In this paper, we have analyzed the potential of leading generic-purpose GenAI tools-Claude Opus, GPT-4 from ChatGPT, and Copilot-in augmenting the penetration testing process as defined by the Penetration Testing Execution Standard (PTES). Our analysis involved evaluating each tool across all PTES phases within a controlled virtualized environment. The findings reveal that, while these tools cannot fully automate the pentesting process, they provide substantial support by enhancing efficiency and effectiveness in specific tasks. Notably, all tools demonstrated utility; however, Claude Opus consistently outperformed the others in our experimental scenarios.
ProFe: Communication-Efficient Decentralized Federated Learning via Distillation and Prototypes
Sánchez, Pedro Miguel Sánchez, Beltrán, Enrique Tomás Martínez, Llamas, Miguel Fernández, Bovet, Gérôme, Pérez, Gregorio Martínez, Celdrán, Alberto Huertas
Decentralized Federated Learning (DFL) trains models in a collaborative and privacy-preserving manner while removing model centralization risks and improving communication bottlenecks. However, DFL faces challenges in efficient communication management and model aggregation within decentralized environments, especially with heterogeneous data distributions. Thus, this paper introduces ProFe, a novel communication optimization algorithm for DFL that combines knowledge distillation, prototype learning, and quantization techniques. ProFe utilizes knowledge from large local models to train smaller ones for aggregation, incorporates prototypes to better learn unseen classes, and applies quantization to reduce data transmitted during communication rounds. The performance of ProFe has been validated and compared to the literature by using benchmark datasets like MNIST, CIFAR10, and CIFAR100. Results showed that the proposed algorithm reduces communication costs by up to ~40-50% while maintaining or improving model performance. In addition, it adds ~20% training time due to increased complexity, generating a trade-off.
Integrated Water Resource Management in the Segura Hydrographic Basin: An Artificial Intelligence Approach
Otamendi, Urtzi, Maiza, Mikel, Olaizola, Igor G., Sierra, Basilio, Flores, Markel, Quartulli, Marco
Managing resources effectively in uncertain demand, variable availability, and complex governance policies is a significant challenge. This paper presents a paradigmatic framework for addressing these issues in water management scenarios by integrating advanced physical modelling, remote sensing techniques, and Artificial Intelligence algorithms. The proposed approach accurately predicts water availability, estimates demand, and optimizes resource allocation on both short- and long-term basis, combining a comprehensive hydrological model, agronomic crop models for precise demand estimation, and Mixed-Integer Linear Programming for efficient resource distribution. In the study case of the Segura Hydrographic Basin, the approach successfully allocated approximately 642 million cubic meters ($hm^3$) of water over six months, minimizing the deficit to 9.7% of the total estimated demand. The methodology demonstrated significant environmental benefits, reducing CO2 emissions while optimizing resource distribution. This robust solution supports informed decision-making processes, ensuring sustainable water management across diverse contexts. The generalizability of this approach allows its adaptation to other basins, contributing to improved governance and policy implementation on a broader scale. Ultimately, the methodology has been validated and integrated into the operational water management practices in the Segura Hydrographic Basin in Spain.
De-VertiFL: A Solution for Decentralized Vertical Federated Learning
Celdrán, Alberto Huertas, Feng, Chao, Banik, Sabyasachi, Bovet, Gerome, Perez, Gregorio Martinez, Stiller, Burkhard
Federated Learning (FL), introduced in 2016, was designed to enhance data privacy in collaborative model training environments. Among the FL paradigm, horizontal FL, where clients share the same set of features but different data samples, has been extensively studied in both centralized and decentralized settings. In contrast, Vertical Federated Learning (VFL), which is crucial in real-world decentralized scenarios where clients possess different, yet sensitive, data about the same entity, remains underexplored. Thus, this work introduces De-VertiFL, a novel solution for training models in a decentralized VFL setting. De-VertiFL contributes by introducing a new network architecture distribution, an innovative knowledge exchange scheme, and a distributed federated training process. Specifically, De-VertiFL enables the sharing of hidden layer outputs among federation clients, allowing participants to benefit from intermediate computations, thereby improving learning efficiency. De-VertiFL has been evaluated using a variety of well-known datasets, including both image and tabular data, across binary and multiclass classification tasks. The results demonstrate that De-VertiFL generally surpasses state-of-the-art methods in F1-score performance, while maintaining a decentralized and privacy-preserving framework.